Chinese sentence segmentation as comma classification
نویسندگان
چکیده
We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
منابع مشابه
Segmentation of Chinese Long Sentences Using Commas
The comma is the most common form of punctuation. As such, it may have the greatest effect on the syntactic analysis of a sentence. As an isolate language, Chinese sentences have fewer cues for parsing. The clues for segmentation of a long Chinese sentence are even fewer. However, the average frequency of comma usage in Chinese is higher than other languages. The comma plays an important role i...
متن کاملMaximum Entropy for Chinese Comma Classification with Rich Linguistic Features
Discourse relation is an important content of discourse semantic analysis, and the study of punctuation is of importance for discourse relation. In this paper, we propose a method of Chinese comma classification based on maximum entropy (ME). This method classifies the sentence relation based on comma with ME by extracting rich linguistic features before and after the commas in sentences. Exper...
متن کاملChinese Comma Disambiguation for Discourse Analysis
The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. We then experimented with two supervised learning methods that automatically disambiguate the Chine...
متن کاملDependency parsing for Chinese long sentence: A second-stage main structure parsing method
This paper explores the problem of parsing Chinese long sentences. Inspired by human sentence processing, a second-stage parsing method, referred as main structure parsing in this paper, are proposed to improve the parsing performance as well as maintaining its high accuracy and efficiency on Chinese long sentences. Three different methods have attempted in this paper and the result shows that ...
متن کاملA Segmentation Matrix Method for Chinese Segmentation Ambiguity Analysis
Chinese Segmentation Ambiguity (CSA) is a fundamental problem confronted when processing Chinese language, where a sentence can generate more than one segmentation paths. Two techniques are commonly used to identify CSA: Omni-segmentation and Bi-directional Maximum Matching (BiMM). Due to the high computational complexity, Omni-segmentation is difficult to be applied for big data. BiMM is easie...
متن کامل